This notebook is the first of a set of steps to run machine learning on the cloud. This step is about data preparation and preprocessing, and will mirror the equivalent portions of the local notebook.
In [1]:
import google.datalab as datalab
import google.datalab.ml as ml
import mltoolbox.regression.dnn as regression
import os
The storage bucket we create will be created by default using the project id.
In [2]:
storage_bucket = 'gs://' + datalab.Context.default().project_id + '-datalab-workspace/'
storage_region = 'us-central1'
workspace_path = os.path.join(storage_bucket, 'census')
# We will rely on outputs from data preparation steps in the previous notebook.
local_workspace_path = '/content/datalab/workspace/census'
In [ ]:
!gsutil mb -c regional -l {storage_region} {storage_bucket}
NOTE: If you have previously run this notebook, and want to start from scratch, then run the next cell to delete previous outputs.
In [ ]:
!gsutil -m rm -rf {workspace_path}
To get started, we will copy the data into this workspace from the local workspace created in the previous notebook.
Generally, in your own work, you will have existing data to work with, that you may or may not need to copy around, depending on its current location.
In [5]:
!gsutil -q cp {local_workspace_path}/data/train.csv {workspace_path}/data/train.csv
!gsutil -q cp {local_workspace_path}/data/eval.csv {workspace_path}/data/eval.csv
!gsutil -q cp {local_workspace_path}/data/schema.json {workspace_path}/data/schema.json
!gsutil ls -r {workspace_path}
In [6]:
train_data_path = os.path.join(workspace_path, 'data/train.csv')
eval_data_path = os.path.join(workspace_path, 'data/eval.csv')
schema_path = os.path.join(workspace_path, 'data/schema.json')
train_data = ml.CsvDataSet(file_pattern=train_data_path, schema_file=schema_path)
eval_data = ml.CsvDataSet(file_pattern=eval_data_path, schema_file=schema_path)
When building a model, a number of pieces of information about the training data are required - for example, the list of entries or vocabulary of a categorical/discrete column, or aggregate statistics like min and max for numerical columns. These require a full pass over the training data, and is usually done once, and needs to be repeated once if you change the schema in a future iteration.
On the Cloud, this analysis is done with BigQuery, by referencing the csv data in storage as external data sources. The output of this analysis will be stored into storage.
In the analyze()
call below, notice the use of cloud=True
to move data analysis from happening locally to happening in the cloud.
In [7]:
analysis_path = os.path.join(workspace_path, 'analysis')
regression.analyze(dataset=train_data, output_dir=analysis_path, cloud=True)
Like in the local notebook, the output of analysis is a stats file that contains analysis from the numerical columns, and a vocab file from each categorical column.
In [8]:
!gsutil ls {analysis_path}
Let's inspect one of the files; in particular the numerical analysis, since it will also tell us some interesting statistics about the income column, the value we want to predict.
In [9]:
!gsutil cat {analysis_path}/stats.json
This notebook completed the first steps of our machine learning workflow - data preparation and analysis. This data and the analysis outputs will be used to train a model, which is covered in the next notebook.